18.6 Noise-Contrastive Estimation

Noise-contrastive estimation estimates probability distribution by

\[\log_{model}(x) = \log \hat{p}(x; \theta) + c\]

where c is explicitly introduced as an approximation of \(-\log Z(\theta)\). Rather than estimate only \(\theta\), the NCE procedure treats c as just another parameter and estimate \(\theta\) and c simultaneously, using same algorithm for both. The resulting \(\log p_{model}(x)\) might not be correspond to exactly to a valid probability distribution, but it will become close and closer to being valid as the estimateof x improves.

NCE works by reducing the unsupervised learning problem of estimating p(x) to that of learning a probablistic binary classifier in which one of the categories corresponds to the data generated by the model. Specifically, we introduce a noise distribution \(p_{noise}(x)\) which is tractable to evaluate and sample from. We can now construct a model over both x and a new, binary class variable y. In the new joint model:

\[\begin{split}p_{joint}(y=1) = \frac{1}{2} \\ p_{joint}(x|y=1) = p_{model}(x) \\ p_{joint}(x|y=0) = p_{noise}(x)\end{split}\]

y is a switch variable that determines whether we will generate x from the model or from the noise distribution.

We can construct a similar joint model of training data. In this case, the switch variable determines whether we draw x from the data or from the noise.

\[\begin{split}p_{train}(y=1) = \frac{1}{2} \\ p_{train}(x|y=1) = p_{data}(x) \\ p_{train}(x|y=0) = p_{noise}(x)\end{split}\]

We can now just use standard maximum likehood learning on the supervised learning problem of fitting \(p_{joint}\) to \(p_{train}\):

\[\theta, c = argmax_{\theta, c} E_{x, y \sim p_{train}} \log p_{joint}(y|x)\]

\(p_{joint}\): a logistic regression model applied to the difference in log probabilities of the model and the noise distribution:

\[\begin{split}\begin {equation} \begin{split} p_{joint}(y=1) &= \frac{p_{model}(x)}{p_{model}(x) + p_{noise}(x)} \\ &= \frac{1}{1 + exp(\log \frac{p_{noise}(x)}{p_{model}(x)})} \\ &=\sigma(\log p_{model}(x) - \log p_{noise}(x)) \end{split} \end {equation}\end{split}\]

NCE is simple to apply as long as

  • \(\hat{p}_{model}\) is easy to back-propagate
  • \(p_{noise}(x)\) is easy to evaluate in order to evaluate \(p_{joint}(x)\)
  • \(p_{noise}(x)\) is easy to sample from to generate training data

When to use Noise Contrastive Estimation:

  • NCE most successful when applied to problem with few random variables, but it can work well even if those random variables can take on a high number of values. E.g: modeling the conditional distribution over a work given the context of the word.
  • Less efficient when applied to problem with many random variables. The logistic regression classifier can reject a noise sample by identifying any one variable whose value is unlikely. This means that learning slow greatly after \(p_{model}\) has learned the basic marginal statistics.

The constraint that \(p_{noise}\) must be easy to evaluate and easy to sample from can be overly restrictive. When \(p_{noise}\) is too simple, most samples are likely to be too obviously distinct from the data to force \(p_{model}\) to improve noticeably.

Like score matching and pseudolikehood, NCE does not work if only a lower bound on \(\hat{p}\) is available.

When the model distribution is copied to define a new noise distribution before each gradient step, NCE defines a procedure called self-contrastive estimation, whose expected gradient is equivalent to the expected gradient of maximum likehood.

  • The special case of NCE where the noise samples are those generated by the model suggests that maxium likehood can be interpreted as a procedure that forces a model to constantly learn to distinguish reality from its own evolving beliefs, while noise
  • Noise contrastive estimation achieves some reduced computational cost by only forcing the model to distinguish reality from a fixed baseline (the noise model).